Skip to content

Fix IndexStatsIT#testThrottleStats #128049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

arteam
Copy link
Contributor

@arteam arteam commented May 13, 2025

Engine.PauseLock#throttle can be called when the lock is being throttled, so we can't guarantee that all permits are available before throttling.

Resolve #126359
See #127173

`Engine.PauseLock#throttle` can be called when the lock is being throttled,
so we can't guarantee that all permits are available before throttling.

Resolve elastic#126359
See elastic#127173
@arteam arteam added >test Issues or PRs that are addressing/adding tests :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels May 13, 2025
@arteam arteam marked this pull request as ready for review May 13, 2025 19:16
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label May 13, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@ankikuma
Copy link
Contributor

ankikuma commented May 20, 2025

Hi Artem, sorry for taking so long to get to this. I am not convinced this is the right fix. We do not call throttle and unthrottle directly on the lock. We go through InternalEngine#activateThrottling and InternalEngine#deactivateThrottling instead, which update a counter to make sure that the throttle and unthrottle calls are matched.
Basically, the callers are expected to follow the semantics that for each activateThrottling call there should be a matching deactivateThrottling call and we throw asserts in the code if this is not the case.
I think the problem might be in the way the ThreadPoolMergeScheduler is calling activateThrottling and deactivateThrottling.

But I am not sure yet where the mismatch might be caused.

@ankikuma ankikuma requested a review from albertzaharovits May 20, 2025 18:54
@ankikuma
Copy link
Contributor

I believe there is a race condition between IndexThrottle#activate() and IndexThrottle#deactivate() because both can be called by different threads.

@henningandersen
Copy link
Contributor

I believe there is a race condition between IndexThrottle#activate() and IndexThrottle#deactivate() because both can be called by different threads.

This makes sense to me too, I think the test could be rewritten to use those handles and provoke the issue.

I agree the fix needs to be done differently, we might need to have a lock inside activeate/deactivateThrottling (though other solutions can be done too, but seems more complex and this should not be a heavily called mechanism).

@albertzaharovits
Copy link
Contributor

albertzaharovits commented May 27, 2025

I believe there is a race condition between IndexThrottle#activate() and IndexThrottle#deactivate() because both can be called by different threads.

But the ThreadPoolMergeScheduler calls the throttling methods under a lock:

. What I think is going is a cycle of IndexThrottle#activate -> IndexThrottle#deactivate -> IndexThrottle#activate where some IndexThrottle#acquireThrottle from the first activate haven't released yet, before the second one. If that's the case, I believe Artem's fix proposal, to remove the assert, should fix this test failure (but I don't think that the added test exercises the fault here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed Indexing Meta label for Distributed Indexing team >test Issues or PRs that are addressing/adding tests v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] IndexStatsIT testThrottleStats failing
5 participants